feat(parser): adopt section-number module in refs/inference/text parsers#115
Conversation
… PRs, TDD task breakdown
…no more base truncation
…tion collided distinct sections
📝 WalkthroughWalkthroughThis PR systematizes section-number canonicalization across the parsing and inference pipeline by importing ChangesSection-number canonicalization pipeline
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/infer-section.ts`:
- Line 18: The ReDoS concern is a false positive because sectionNumberFragment()
returns a fixed module-local constant FRAGMENT composed of fixed-width pieces
(e.g., \d{2}, optional \.\d{2}, optional single agency token) and is
interpolated into new RegExp(...) in infer-section.ts; update the code by adding
a concise inline comment next to sectionNumberFragment() and the new RegExp(...)
call explaining that FRAGMENT is fixed-width and contains no nested
unbounded/repeating groups, and optionally add a small unit/perf test for
infer-section.ts that runs the regex against long adversarial strings to prove
it doesn't exhibit catastrophic backtracking.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: a7f86d70-a57d-4ac5-9819-058461614eac
📒 Files selected for processing (10)
src/lib/infer-section.test.tssrc/lib/infer-section.tssrc/parser/refs/extract.test.tssrc/parser/refs/extract.tssrc/parser/refs/rules.test.tssrc/parser/refs/rules.tssrc/parser/sec/index.test.tssrc/parser/sec/index.tssrc/parser/text/index.test.tssrc/parser/text/index.ts
Summary
Stacked on #114. Fixes the silent suffix-truncation data corruption across all content-scanning parsers by adopting
sectionNumberFragment()/normalizeSectionNumber():parser/refs(prose citations):See Section 26 00 13.10no longer stores a wrong link to base26 00 13— the suffix is captured and normalized.lib/infer-section: keyword scan keeps.33(01 33 23.33≠01 33 23), bare suffixed headers infer instead of returningnone, inline titles after suffixed numbers are extracted again, and dash-separated inline titles now strip the dash (parity with the text parser).parser/text:.txtheaders keep suffixed sections AND their dash titles (previouslySECTION 27 05 13.43 - TITLEtruncated the section and lost the title).parser/sec: SCN/SRF values normalize-or-verbatim — whitespace/NBSP dirt canonicalized, tagged values never rejected (3 corpus files carry SCN dirt). Internal-dirt normalization is mutation-test pinned.Test Plan
pnpm test src/parser/ src/lib/infer-section.test.ts— regression tests named by symptom (e.g.'infer-section: keyword scan keeps .33 — 01 33 23.33 is not 01 33 23')pnpm lint && pnpm test— 564 unit tests green at this commit; pre-existing suffix pins (27 05 13.43) untouchedOut of Scope
This PR does NOT tighten any validation gate (API schemas/worker gate land in PR 3; DB CHECK in PR 4). Parser outputs for conforming inputs are unchanged; dirty inputs are strictly improved.
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests